Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Yingxin Zhang, UID: 006140202

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.7.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.7       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.7    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.1       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:  16.000 GiB 
Freeram:   78.672 MiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 8859752
-rw-rw-r--@ 1 yingxin  staff    15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 yingxin  staff      427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 yingxin  staff      859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 yingxin  staff      578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 yingxin  staff       12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 yingxin  staff    25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 yingxin  staff     7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 yingxin  staff   508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 yingxin  staff   471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 yingxin  staff     1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff  1939088924 Jan  5  2023 labevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff    96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff    36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 yingxin  staff     2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 yingxin  staff   398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 yingxin  staff   498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 yingxin  staff    25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 yingxin  staff   458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 yingxin  staff     6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 yingxin  staff      122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 yingxin  staff     6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 yingxin  staff    36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/mimic/icu/
total 6155968
-rw-rw-r--@ 1 yingxin  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 yingxin  staff  2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 yingxin  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 yingxin  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff    20717852 Jan  5  2023 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

Answer:

  • According to the output of the codes below, fread is the fastest, only using about 0.5s. read_csv is the second fastest, using about 0.7s. read.csv is the slowest, using about 1.9s.

  • The parsed data types are different for read_csv, read.csv and fread. For example, in the parsed data types for read_csv, the subject_id is numeric, while in the parsed data types for fread and read.csv, the subject_id is integer. In the parsed data types for fread and read_csv, the admittime is POSIXct, while in the parsed data types for read.csv, the admittime is character.

  • The memory usage of read_csv, read.csv and fread are 55.31 MB, 158.71 MB, 50.13 MB respectively. Among them, fread uses the least memory.

(1)speed

# the speed of read_csv
system.time(df1 <- read_csv("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  1.510   1.159   0.951 
# the speed of fread
system.time(df2 <- fread("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  0.467   0.132   0.615 
# the speed of read.csv
system.time(df3 <- read.csv("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  3.233   0.054   3.295 

(2)data types

# the data types of read_csv
str(df1)
spc_tbl_ [431,231 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id          : num [1:431231] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
 $ hadm_id             : num [1:431231] 22595853 22841357 25742920 29079034 25022803 ...
 $ admittime           : POSIXct[1:431231], format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct[1:431231], format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct[1:431231], format: NA NA ...
 $ admission_type      : chr [1:431231] "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr [1:431231] "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr [1:431231] "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr [1:431231] "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr [1:431231] "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr [1:431231] "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr [1:431231] "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr [1:431231] "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct[1:431231], format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct[1:431231], format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: num [1:431231] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   hadm_id = col_double(),
  ..   admittime = col_datetime(format = ""),
  ..   dischtime = col_datetime(format = ""),
  ..   deathtime = col_datetime(format = ""),
  ..   admission_type = col_character(),
  ..   admit_provider_id = col_character(),
  ..   admission_location = col_character(),
  ..   discharge_location = col_character(),
  ..   insurance = col_character(),
  ..   language = col_character(),
  ..   marital_status = col_character(),
  ..   race = col_character(),
  ..   edregtime = col_datetime(format = ""),
  ..   edouttime = col_datetime(format = ""),
  ..   hospital_expire_flag = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
# the data types of fread
str(df2)
Classes 'data.table' and 'data.frame':  431231 obs. of  16 variables:
 $ subject_id          : int  10000032 10000032 10000032 10000032 10000068 10000084 10000084 10000108 10000117 10000117 ...
 $ hadm_id             : int  22595853 22841357 25742920 29079034 25022803 23052089 29888819 27250926 22927623 27988844 ...
 $ admittime           : POSIXct, format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct, format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct, format: NA NA ...
 $ admission_type      : chr  "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr  "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr  "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr  "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr  "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr  "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr  "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct, format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct, format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 
# the data types of read.csv
str(df3)
'data.frame':   431231 obs. of  16 variables:
 $ subject_id          : int  10000032 10000032 10000032 10000032 10000068 10000084 10000084 10000108 10000117 10000117 ...
 $ hadm_id             : int  22595853 22841357 25742920 29079034 25022803 23052089 29888819 27250926 22927623 27988844 ...
 $ admittime           : chr  "2180-05-06 22:23:00" "2180-06-26 18:27:00" "2180-08-05 23:44:00" "2180-07-23 12:35:00" ...
 $ dischtime           : chr  "2180-05-07 17:15:00" "2180-06-27 18:49:00" "2180-08-07 17:50:00" "2180-07-25 17:55:00" ...
 $ deathtime           : chr  "" "" "" "" ...
 $ admission_type      : chr  "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr  "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr  "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr  "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr  "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr  "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr  "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : chr  "2180-05-06 19:17:00" "2180-06-26 15:54:00" "2180-08-05 20:58:00" "2180-07-23 05:54:00" ...
 $ edouttime           : chr  "2180-05-06 23:30:00" "2180-06-26 21:31:00" "2180-08-06 01:44:00" "2180-07-23 14:00:00" ...
 $ hospital_expire_flag: int  0 0 0 0 0 0 0 0 0 0 ...

(3)memory

# the memory usage of read_csv
object_size(df1)
55.31 MB
# the memory usage of fread
object_size(df2)
50.13 MB
# the memory usage of read.csv
object_size(df3)
158.71 MB

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

Answer: The run time is about 0.7s and does not change much. The result tibble uses the same amount of memory.

system.time(read_csv("~/mimic/hosp/admissions.csv.gz", col_types = cols()))
   user  system elapsed 
  1.406   1.247   0.622 
object_size(read_csv("~/mimic/hosp/admissions.csv.gz", col_types = cols()))
55.31 MB

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 yingxin  staff  1939088924 Jan  5  2023 /Users/yingxin/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.

Answer: The program takes more than 5 minutes on my computer, so I aborted the program and set the eval option to false for the following code chunk.

system.time(read_csv("~/mimic/hosp/labevents.csv.gz"))

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

Answer: This alleviates the ingestion issue. It takes about 2 minutes for read_csv to ingest the 4 columns labevents.csv.gz.

system.time(read_csv("~/mimic/hosp/labevents.csv.gz", 
            col_select = c("subject_id", "itemid", "charttime", "valuenum")))
   user  system elapsed 
126.172 100.847 119.186 

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

Answer: The Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz is displayed below.

#display the first 10 lines of labevents.csv.gz
zcat < ~/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,
zcat < ~/mimic/hosp/labevents.csv.gz | awk -F, '($5 == 50912 || $5 == 50971 || $5 == 50983 || $5 == 50902 || $5 == 50882 || $5 == 51221 || $5 == 51301 || $5 == 50931) {print $2","$5","$7","$10}' | gzip > labevents_filtered.csv.gz

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

Answer: The first 10 lines of labevents_filtered.csv.gz are displayed below. The number of lines in labevents_filtered.csv.gz is 24855909. It takes about 9s for read_csv to ingest labevents_filtered.csv.gz.

# display the first 10 lines of labevents_filtered.csv.gz
zcat < labevents_filtered.csv.gz | head -10
10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,50931,2180-03-23 11:51:00,95
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
10000032,51301,2180-05-06 22:25:00,5
# the number of lines in labevents_filtered.csv.gz
zcat < labevents_filtered.csv.gz | wc -l
zcat: (stdin): unexpected end of file
 13802844
system.time(read_csv("labevents_filtered.csv.gz"))
   user  system elapsed 
 11.465  11.721   3.756 

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

Answer: The Bash command to decompress labevents.csv.gz to labevents.csv is displayed below. To save render time, I put #| eval: false at the beginning of this code chunk and upload the labevents.csv file to the github.

zcat < ~/mimic/hosp/labevents.csv.gz > labevents.csv

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Answer: The ingest+select+filter process takes about 0.5s. The number of rows in the result tibble is 24855909, which is the same as that in Q2.3. The first 10 rows of the result tibble are also the same as those in Q2.3 (The charttime is different because of the different time zone).

system.time(
  labevents_filter <- arrow::open_dataset("labevents.csv", format = 'csv') |> 
  dplyr::select(subject_id, itemid, charttime, valuenum) |> 
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)))
   user  system elapsed 
  0.052   0.015   0.093 
labevents_filter |>  
  as_tibble() |> 
  nrow()
[1] 24855909
labevents_filter |> 
  as_tibble() |> 
  head(10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

Answer: Apache Arrow is a cross-language development platform for in-memory data that specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It is designed to accelerate big data processing and analytics by providing a standard in-memory representation of data that works across different systems.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Answer: The Parquet file labevents.parquet is 1.9G. The ingest+select+filter process of the Parquet file takes about 0.3s. The number of rows in the result tibble is 24855909, which is the same as that in Q2.3 and Q2.4. The first 10 rows of the result tibble are the same as those in Q2.3 (The charttime is different because of the different time zone).

# Re-write the csv file `labevents.csv` in the binary Parquet format
lab_csv <- arrow::open_dataset("labevents.csv", format = 'csv')
arrow::write_dataset(lab_csv, "labevents.parquet")
# How large is the Parquet file(s)?
ls -lh labevents.parquet
total 4063240
-rw-r--r--  1 yingxin  staff   1.9G Feb  8 22:27 part-0.parquet
# How long does the ingest+select+filter process of the Parquet file(s) take?
system.time(
  lab_parquet <- arrow::open_dataset("labevents.parquet", format = 'parquet') |> 
  dplyr::select(subject_id, itemid, charttime, valuenum) |> 
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)))
   user  system elapsed 
  0.134   0.011   0.146 
# Display the number of rows of the result tibble
lab_parquet |> 
  as_tibble() |> 
  nrow()
[1] 24855909
# Display the first 10 rows of the result tibble
lab_parquet |> 
  as_tibble() |> 
  head(10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

Answer: The Parquet format is a columnar storage file format that is optimized for reading and writing large datasets. Just like how a courier service uses standardized packages to deliver goods efficiently, Apache Arrow uses a standardized format for data, making it very fast for different programs to exchange and work with data, no matter what language or system they’re using.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Answer: The ingest+convert+select+filter process takes about 0.3s. The number of rows in the result tibble is 24855909, which is the same as that in Q2.3, Q2.4 and Q2.5. The first 10 rows of the result tibble are also the same as those in Q2.3.

# Ingest the Parquet file, convert it to a DuckDB table, select columns, and filter rows
system.time(
  lab_DuckDB <- arrow::open_dataset("labevents.parquet") |> 
  arrow::to_duckdb() |> 
  dplyr::select(subject_id, itemid, charttime, valuenum) |> 
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
)
   user  system elapsed 
  0.341   0.040   0.726 
# Display the number of rows of the result tibble
lab_DuckDB |> 
  as_tibble() |> 
  nrow()
[1] 24855909
# Display the first 10 rows of the result tibble
lab_DuckDB |> 
  as_tibble() |> 
  head(10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10005606  50882 2143-12-07 03:55:00     30  
 2   10005606  50902 2143-12-07 03:55:00     99  
 3   10005606  50912 2143-12-07 03:55:00      0.6
 4   10005606  50931 2143-12-07 03:55:00    169  
 5   10005606  50971 2143-12-07 03:55:00      4.6
 6   10005606  50983 2143-12-07 03:55:00    138  
 7   10005606  51221 2143-12-07 12:47:00     24  
 8   10005606  51301 2143-12-07 12:47:00     21.5
 9   10005606  51221 2143-12-08 02:16:00     21.7
10   10005606  51301 2143-12-08 02:16:00     15.7

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator. Answer: DuckDB is a special kind of database that’s really fast and efficient at storing and retrieving information. Just like how a filing cabinet helps you organize and quickly find documents, DuckDB helps computers store and retrieve data in a way that’s easy, fast, and doesn’t take up too much space.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

Answer: The Bash command to retrieve a subset of chartevents.csv.gz only containing these items is displayed below. The number of rows of the result tibble is 22502319, and the first 10 rows of the result tibble are also displayed below.

# Retrieve a subset of `chartevents.csv.gz` only containing these items
#| eval: false
zcat < ~/mimic/icu/chartevents.csv.gz | awk -F, '($7 == 220045 || $7 == 220181 || $7 == 220179 || $7 == 223761 || $7 == 220210)' | gzip > chartevents_filtered.csv.gz
# Display the number of rows of the result tibble
zcat < chartevents_filtered.csv.gz | wc -l
 22502319
# Display the first 10 rows of the result tibble
zcat < ./chartevents_filtered.csv.gz | head -10
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220045,97,97,bpm,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220179,93,93,mmHg,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220181,56,56,mmHg,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220210,16,16,insp/min,0